A crash course in R
(and gretl)

Getting started

There are a lot of different ways to work with R. Here, we are going to focus on a few key things that will be helpful for your homeworks and emprical project

You can learn more about any R command by typing help(commandname). Sometimes these help files can be pretty complicated, don’t be afraid to consult me, the internet, or the resources below if you get stuck

For more detailed introductions to R’s built-in capabilities, see An introduction to R and FasteR: The fast lane to learning R

For more on the Tidyverse packages (and a few other things), see R for Data Science

You can download R from the R project website (Click on Download > CRAN, choose a nearby mirror, and choose the version for your operating system)

RStudio is a program that makes it easier to work in R. RStudio is free for non-commerical use, and can be downloaded from the Posit website

Interacting with R

You can interact with R by typing commands directly into R’s “console window,” but this isn’t recommended

Instead, you should create an R Script (a file that ends with .R) in RStudio, that contains a list of commands that you want to run. This will give you a reproducible record of everything you did

After you’ve typed a command, you can run it by hitting Command + Enter (on macOS), or by highlighting it, then clicking on run (clicking run without highlighting anything will run the whole file)

Basics

R can do simple calculations, like

2 + 2
[1] 4

You can store an object in the “workspace” using “<-” (the “assignment operator”):

a <- 5
a 
[1] 5
a*2
[1] 10
a^2
[1] 25

It’s more common to work with vectors, which are lists of numbers or characters

You can create them using c() (the “concatenate operator”):

b <- c(1, 5, 10)
c <- c("Hello", ",", "how", "are", "you", "?")
b
[1]  1  5 10
b/5
[1] 0.2 1.0 2.0
c
[1] "Hello" ","     "how"   "are"   "you"   "?"    

(Note that on the left, “c” is the name of the vector, while on the right it is a function. This is ok)

Saving your work

You can save your workspace using

setwd("~/Desktop")
save.image("MyRFile.RData")

And you can open an existing one using

setwd("~/Desktop")
load("MyRFile.RData")

setwd stands for “set working directory.” You need to customize this to the location of your file

You can also do all of this from the Session menu in RStudio

Packages

Packages are external commands that extend R’s capabilities

You can install them using

install.packages("packagename")

and load them (so that R can use them) using

library(packagename)

Note that you use quotes in the first case but not the second

Reading data

R can open lots of different kinds of data. We’ll focus on CSV (comma separated value) and Excel, since these are two of the most common formats

Both of these come with the Tidyverse set of packages

Our dataset contains information on state-level murder rates

We can import the CSV file using

library(tidyverse)
murder <- read_csv("murder.csv")

This saves the data as a dataframe, which is basically just a list of vectors (the variables in the dataset, similar to a spreadsheet)

If the data are saved in Excel format, we can import then using

library(readxl)
murder <- read_excel("murder.xlsx", sheet="Sheet1")

Once we’ve imported the dataset, we can take a quick look at it using

head(murder)
# A tibble: 6 × 13
     id state  year mrdrte  exec  unem   d90   d93 cmrdrte cexec  cunem cexec_1
  <dbl> <chr> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>  <dbl>   <dbl>
1     1 AL       87   9.30     2  7.80     0     0   NA       NA NA          NA
2     1 AL       90  11.6      5  6.80     1     0    2.30     3 -1          NA
3     1 AL       93  11.6      2  7.5      0     1    0       -3  0.700       3
4     2 AK       87  10.1      0 10.8      0     0   NA       NA NA          NA
5     2 AK       90   7.5      0  6.90     1     0   -2.60     0 -3.90       NA
6     2 AK       93   9        0  7.60     0     1    1.5      0  0.700       0
# ℹ 1 more variable: cunem_1 <dbl>

You can view the entire dataset in RStudio by typing View(murder)

You can also get a list of all of the variables using str (which stands for “structure”):

str(murder)
tibble [153 × 13] (S3: tbl_df/tbl/data.frame)
 $ id     : num [1:153] 1 1 1 2 2 2 3 3 3 4 ...
 $ state  : chr [1:153] "AL" "AL" "AL" "AK" ...
 $ year   : num [1:153] 87 90 93 87 90 93 87 90 93 87 ...
 $ mrdrte : num [1:153] 9.3 11.6 11.6 10.1 7.5 ...
 $ exec   : num [1:153] 2 5 2 0 0 0 0 0 3 0 ...
 $ unem   : num [1:153] 7.8 6.8 7.5 10.8 6.9 ...
 $ d90    : num [1:153] 0 1 0 0 1 0 0 1 0 0 ...
 $ d93    : num [1:153] 0 0 1 0 0 1 0 0 1 0 ...
 $ cmrdrte: num [1:153] NA 2.3 0 NA -2.6 ...
 $ cexec  : num [1:153] NA 3 -3 NA 0 0 NA 0 3 NA ...
 $ cunem  : num [1:153] NA -1 0.7 NA -3.9 ...
 $ cexec_1: num [1:153] NA NA 3 NA NA 0 NA NA 0 NA ...
 $ cunem_1: num [1:153] NA NA -1 NA NA ...

Visualizing data

For more complex graphs, I recommend using the ggplot2 package, which is part of the Tidyverse

We can look at the distribution of murder rates using:

ggplot(murder, aes(x = mrdrte)) + geom_histogram()

aes(x = mrdrte) (the “aesthetic”) tells R which variable(s) we’re graphing

geom_histogram (the “geometric object”) tells R what kind of graph we want

We could look at the relationship between murder rates and executions using:

ggplot(murder, aes(x = exec, y = mrdrte)) + geom_point()

We can use additional “geoms” to add more to the graph. For example, to add a regression line, we can use:

ggplot(murder, aes(x = exec, y = mrdrte)) + 
  geom_point() + geom_smooth(method="lm")

If we wanted a different graph for each year, we could add facet_wrap (the scales = "free" option tells R to use different axes for each year):

ggplot(murder, aes(x = exec, y = mrdrte)) + 
  geom_point() + geom_smooth(method="lm") +
  facet_wrap(~ year, scales="free")

Summarizing data

You can obtain very simple summary statistics using summary:

summary(murder)
       id        state                year        mrdrte            exec       
 Min.   : 1   Length:153         Min.   :87   Min.   : 0.800   Min.   : 0.000  
 1st Qu.:13   Class :character   1st Qu.:87   1st Qu.: 3.900   1st Qu.: 0.000  
 Median :26   Mode  :character   Median :90   Median : 6.400   Median : 0.000  
 Mean   :26                      Mean   :90   Mean   : 8.071   Mean   : 1.229  
 3rd Qu.:39                      3rd Qu.:93   3rd Qu.:10.200   3rd Qu.: 1.000  
 Max.   :51                      Max.   :93   Max.   :78.500   Max.   :34.000  
                                                                               
      unem             d90              d93            cmrdrte       
 Min.   : 2.200   Min.   :0.0000   Min.   :0.0000   Min.   :-2.6000  
 1st Qu.: 4.900   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:-0.4000  
 Median : 5.800   Median :0.0000   Median :0.0000   Median : 0.3000  
 Mean   : 5.973   Mean   :0.3333   Mean   :0.3333   Mean   : 0.8422  
 3rd Qu.: 7.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 1.3000  
 Max.   :12.000   Max.   :1.0000   Max.   :1.0000   Max.   :41.6000  
                                                    NA's   :51       
     cexec              cunem              cexec_1            cunem_1       
 Min.   :-11.0000   Min.   :-5.800000   Min.   :-11.0000   Min.   :-5.8000  
 1st Qu.:  0.0000   1st Qu.:-1.075000   1st Qu.:  0.0000   1st Qu.:-1.9500  
 Median :  0.0000   Median : 0.300000   Median :  0.0000   Median :-1.0000  
 Mean   :  0.1863   Mean   : 0.005882   Mean   : -0.2745   Mean   :-0.8863  
 3rd Qu.:  0.0000   3rd Qu.: 1.000000   3rd Qu.:  0.0000   3rd Qu.: 0.0000  
 Max.   : 23.0000   Max.   : 3.600000   Max.   :  5.0000   Max.   : 3.1000  
 NA's   :51         NA's   :51          NA's   :102        NA's   :102      

Another way to get quick summary statistics is using the describe function from the psych package (the :: notation is a way to use a package without loading it):

psych::describe(murder)
        vars   n  mean    sd median trimmed   mad   min  max range  skew
id         1 153 26.00 14.77   26.0   26.00 19.27   1.0 51.0  50.0  0.00
state*     2 153 26.00 14.77   26.0   26.00 19.27   1.0 51.0  50.0  0.00
year       3 153 90.00  2.46   90.0   90.00  4.45  87.0 93.0   6.0  0.00
mrdrte     4 153  8.07  9.19    6.4    6.88  4.45   0.8 78.5  77.7  5.95
exec       5 153  1.23  3.79    0.0    0.35  0.00   0.0 34.0  34.0  5.72
unem       6 153  5.97  1.68    5.8    5.89  1.48   2.2 12.0   9.8  0.65
d90        7 153  0.33  0.47    0.0    0.29  0.00   0.0  1.0   1.0  0.70
d93        8 153  0.33  0.47    0.0    0.29  0.00   0.0  1.0   1.0  0.70
cmrdrte    9 102  0.84  4.29    0.3    0.41  1.33  -2.6 41.6  44.2  8.40
cexec     10 102  0.19  2.95    0.0    0.04  0.00 -11.0 23.0  34.0  4.02
cunem     11 102  0.01  1.66    0.3    0.05  1.41  -5.8  3.6   9.4 -0.47
cexec_1   12  51 -0.27  2.19    0.0   -0.05  0.00 -11.0  5.0  16.0 -2.67
cunem_1   13  51 -0.89  1.73   -1.0   -0.95  1.63  -5.8  3.1   8.9  0.12
        kurtosis   se
id         -1.22 1.19
state*     -1.22 1.19
year       -1.52 0.20
mrdrte     41.80 0.74
exec       40.36 0.31
unem        1.06 0.14
d90        -1.52 0.04
d93        -1.52 0.04
cmrdrte    76.90 0.42
cexec      35.09 0.29
cunem       0.55 0.16
cexec_1    11.22 0.31
cunem_1     0.47 0.24

You can get customized summaries using the Tidyverse packages

To get the mean and standard deviation of the murder rate as well as the number of observations, we can use

murder |> summarize(mean_mrdrte = mean(mrdrte), 
                    sd_mrdrte = sd(mrdrte),
                    mean_exec = mean(exec),
                    sd_exec = sd(exec),
                    n = n()) 
# A tibble: 1 × 5
  mean_mrdrte sd_mrdrte mean_exec sd_exec     n
        <dbl>     <dbl>     <dbl>   <dbl> <int>
1        8.07      9.19      1.23    3.79   153

|>” is the pipe operator, which tells R which dataframe we’re working with (so we don’t need the $ syntax)

If we only wanted to do this for a certain year, we could combine this with filter:

murder |> filter(year==87) |>
  summarize(mean_mrdrte = mean(mrdrte), 
            sd_mrdrte = sd(mrdrte),
            mean_exec = mean(exec),
            sd_exec = sd(exec),
            n = n()) 
# A tibble: 1 × 5
  mean_mrdrte sd_mrdrte mean_exec sd_exec     n
        <dbl>     <dbl>     <dbl>   <dbl> <int>
1        7.04      5.22      1.20    3.62    51

Note that if our command spans lines, we have to put |> at the end of the line

We we wanted to know these statistics for each year, we could use group_by:

murder |> group_by(year) |>
 summarize(mean_mrdrte = mean(mrdrte), 
           sd_mrdrte = sd(mrdrte),
           mean_exec = mean(exec),
           sd_exec = sd(exec),
           n = n())  
# A tibble: 3 × 6
   year mean_mrdrte sd_mrdrte mean_exec sd_exec     n
  <dbl>       <dbl>     <dbl>     <dbl>   <dbl> <int>
1    87        7.04      5.22     1.20     3.62    51
2    90        8.44     10.6      0.922    2.18    51
3    93        8.73     10.7      1.57     5.06    51

Transforming data

You can add a new variable to a dataframe using

murder$unem_sq <- murder$unem^2

The $ tells R that the variable is part of the dataframe murder

If you are going to make a lot of new variables, it can be easier to use mutate from the Tidyverse packages:

murder <- murder |> mutate(unem_sq = unem^2,
                           exec_sq = exec^2,
                           year90 = (year==90),
                           year93 = (year==93))

The |> syntax is the pipe operator, which tells R which dataframe. you are working in

The (year==90) syntax is the indicator function that equals one if the condition in parentheses is true and zero otherwise

Note that we use double equals signs whenever evaluating whether a condition is true

We could change the values of the year variables using:

murder <- murder |> mutate(year = case_when(year == 87 ~ 1987, 
                                            year == 90 ~ 1990,
                                            year == 93 ~ 1993))
murder |> head()
# A tibble: 6 × 17
     id state  year mrdrte  exec  unem   d90   d93 cmrdrte cexec  cunem cexec_1
  <dbl> <chr> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>  <dbl>   <dbl>
1     1 AL     1987   9.30     2  7.80     0     0   NA       NA NA          NA
2     1 AL     1990  11.6      5  6.80     1     0    2.30     3 -1          NA
3     1 AL     1993  11.6      2  7.5      0     1    0       -3  0.700       3
4     2 AK     1987  10.1      0 10.8      0     0   NA       NA NA          NA
5     2 AK     1990   7.5      0  6.90     1     0   -2.60     0 -3.90       NA
6     2 AK     1993   9        0  7.60     0     1    1.5      0  0.700       0
# ℹ 5 more variables: cunem_1 <dbl>, unem_sq <dbl>, exec_sq <dbl>,
#   year90 <lgl>, year93 <lgl>

Merging data

Suppose that you were working with the murder data, but wanted to add state\(\times\)year-level variables from another source? How could you merge these two sources?

First, let me create a fake dataset to merge in:

fake <- murder |> select(state, year) |> mutate(newvar = rnorm(n()))

rnorm(n()) creates a normally distributed variable with the same number of observations as our dataset

Now we can merge these data using left_join:

murder <- murder |> left_join(fake, join_by(state, year))
murder |> select(state, year, mrdrte, newvar) |> head()
# A tibble: 6 × 4
  state  year mrdrte  newvar
  <chr> <dbl>  <dbl>   <dbl>
1 AL     1987   9.30  0.457 
2 AL     1990  11.6  -1.51  
3 AL     1993  11.6   0.0975
4 AK     1987  10.1   0.718 
5 AK     1990   7.5   0.322 
6 AK     1993   9    -0.934 

This is called a “left join” because it always keeps the original data, even if it doesn’t get matched to the new dataset (there are other types of joins that R can do, but this is the most common)

More

We will see examples of other techniques as the course progresses

For example, if we want to “run a regression” of murder rates on execution rates, we can use the “linear model” function `lm’ (we’ll learn what this means later):

model <- lm(mrdrte ~ exec, murder)

This saves the result under model. To view the results, we can use

summary(model)

Call:
lm(formula = mrdrte ~ exec, data = murder)

Residuals:
   Min     1Q Median     3Q    Max 
-6.966 -3.866 -1.566  1.898 70.734 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.7658     0.7800   9.957   <2e-16 ***
exec          0.2481     0.1963   1.264    0.208    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.175 on 151 degrees of freedom
Multiple R-squared:  0.01047,   Adjusted R-squared:  0.003914 
F-statistic: 1.597 on 1 and 151 DF,  p-value: 0.2082

Quarto

Quarto allows you to write documents and presentions using R and RStudio

I used Quarto to make these slides. You can use it for your homeworks and empirical presentation (if you want, you don’t have to)

To create a Quarto presentation, select File > New File > Quarto Document

To create a presentation, select Quarto Presentation instead

You can click Render to turn your file into an HTML, PDF, Word or PowerPoint file

The basic syntax looks like this:


# Header

Some text, *some italic text*, **some bold text**

* Bulleted
* List

```{r}
R commands here
```

---

This is a new slide...

If you prefer, it’s perfectly fine to work in Word, Powerpoint, Google Docs, etc. instead

If you do, when you are pasting R output into your file, please use a monospaced font (Courier New, Consalas, etc.) so that everything lines up correctly

gretl basics

If you hate the idea of coding in R (understandable), gretl allows you to do much (but not all) of the same things using a graphical interface

Gretl is free software that works on all major platforms

You can download it from the gretl website

You can import data using File > Open Data > User File, then selecting the file type (csv, Excel, …). Gretl might ask you some questions about the format of the data after you do this

Note: You can use File > Open Data > Sample File to see some sample datasets that might be helpful for your empirical projects

You can get basic descriptive statistics by clicking on View > Summary statistics

gretl can do more advanced summary statistics, but it requires coding that isn’t any easier than in R

You can plot a histogram by going to Variable > Frequency distribution

You can make a scatter plot by selecting View > Graph specified vars > X-Y scatter

You can add transformations of variables using the Add menu (the Define new variable option lets you use arbitrary expressions, like z = mrdrte + exec for the sum of the murder and execution rates)

You can run a regression by going to Model > Ordinary Least Squares, selecting the dependent and independent variables, and clicking ok

You can paste the output into Word, PowerPoint, etc.

From the output window, you can also do things like modify the model, run additional tests (that we will discuss later in the semester) or plot the predictions from the model

You can go to File > Save to session as icon to save the results for future reference

You can save your entire gretl session by going to File > Session files > Save session

Here is what a gretl session looks like: